A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases
نویسندگان
چکیده
We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-words association pattern is an expression such as (TATA, 30, AGGAGGT)) C that expresses a rule that if a text contains a subword TATA followed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with high probability. The optimized con dence pattern problem is to compute frequent patterns ( ; k; ) that optimize the con dence with respect to a given collection of texts. Although this problem is solved in polynomial time by a straightforward algorithm that enumerates all the possible patterns in time O(n 5 ), we focus on the development of more e cient algorithms that can be applied to large text databases. We present an algorithm that solves the optimized con dence pattern problem in time O(maxfk;mgn 2 ) and space O(kn), wherem and n are the number and the total length of classi cation examples, respectively, and k is a small constant around 30 50. This algorithm combines the su x tree data structure in combinatorial string matching and the orthogonal range query technique in computational geometry for fast computation. Furthermore for most random texts like DNA sequences, we show that a modi cation of the algorithm runs very e ciently in time O(kn log 3 n) and space O(kn). We also discuss some heuristics such as sampling and pruning as practical improvement. Then, we evaluate the e ciency and the performance of the algorithm with experiments on genetic sequences. A relationship with e cient Agnostic PAC-learning is also discussed.
منابع مشابه
Efficient Discovery of Proximity Patterns with Suffix Arrays
We describe an efficient implementation of a text mining algorithm for discovering a class of simple string patterns. With an index structure, called the virtual suffix tree, for pattern discovery built on the top of the suffix array, the resulting algorithm is simple and fast in practice compared with the previous implementation with the suffix tree.
متن کاملBlock-Suffix Shifting: Fast, Simultaneous Medical Concept Set Identification in Large Medical Record Corpora
Owing to new advances in computer hardware, large text databases have become more prevalent than ever.Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we present a new, fast multi-string pattern matching method based on the well known Aho-Chorasick algorithm. Advantages of our algorithm include:the abili...
متن کاملFinding a Haystack in Haystacks - Simultaneous Identification of Concepts in Large Bio-Medical Corpora
Since nearly all information is now created digitally, large text databases have become more prevalent than ever. Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we introduce a new, fast multi-string pattern matching method called the Block Suffix Shifting (BSS) algorithm, which is based on the well kno...
متن کاملText Mining with Information Extraction
The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this paper, we develop a text-mining system by integrating methods from Information Extraction (IE) and Data Mining (Knowledge Discovery from Databases ...
متن کاملText Data Mining with Optimized Pattern Discovery
This paper describes an application of the optimized pattern discovery framework to text and Web mining. In particular, we introduce a class of simple combinatorial patterns over phrases, called proximity phrase association patterns, and consider the problem of nding the patterns that optimizes a given statistical measure in a large collection of unstructured texts. For this class of patterns, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998